Lista 1¶
1. Zapoznanie się z opisem zbioru danych, wybór odpowiedniego zakresu danych, eksploracyjna analiza danych.¶
zapoznanie się z opisem zbioru danych¶
Do analizy oraz przetworzenia danych wybrano zbiór danych pochodzący z 4 roku prognozy, ponieważ ma najwięcej spółek, które osiągneły bankructwo.
- Dane zawierają wskaźniki finansowe z 4 roku okresu prognozy i odpowiednią etykietę klasy, która wskazuje status bankructwa po 2 latach.
- Dane zawierają 9792 instancje (sprawozdania finansowe), 515 reprezentuje upadłe firmy, 9277 firm, które nie zbankrutowały w okresie prognozy.
from scipy.io import arff
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 200)
file_path= 'pcbd/4year.arff'
data, meta = arff.loadarff(file_path)
print(meta)
Dataset: '1year-weka.filters.unsupervised.instance.SubsetByExpression-Enot
Attr1's type is numeric
Attr2's type is numeric
Attr3's type is numeric
Attr4's type is numeric
Attr5's type is numeric
Attr6's type is numeric
Attr7's type is numeric
Attr8's type is numeric
Attr9's type is numeric
Attr10's type is numeric
Attr11's type is numeric
Attr12's type is numeric
Attr13's type is numeric
Attr14's type is numeric
Attr15's type is numeric
Attr16's type is numeric
Attr17's type is numeric
Attr18's type is numeric
Attr19's type is numeric
Attr20's type is numeric
Attr21's type is numeric
Attr22's type is numeric
Attr23's type is numeric
Attr24's type is numeric
Attr25's type is numeric
Attr26's type is numeric
Attr27's type is numeric
Attr28's type is numeric
Attr29's type is numeric
Attr30's type is numeric
Attr31's type is numeric
Attr32's type is numeric
Attr33's type is numeric
Attr34's type is numeric
Attr35's type is numeric
Attr36's type is numeric
Attr37's type is numeric
Attr38's type is numeric
Attr39's type is numeric
Attr40's type is numeric
Attr41's type is numeric
Attr42's type is numeric
Attr43's type is numeric
Attr44's type is numeric
Attr45's type is numeric
Attr46's type is numeric
Attr47's type is numeric
Attr48's type is numeric
Attr49's type is numeric
Attr50's type is numeric
Attr51's type is numeric
Attr52's type is numeric
Attr53's type is numeric
Attr54's type is numeric
Attr55's type is numeric
Attr56's type is numeric
Attr57's type is numeric
Attr58's type is numeric
Attr59's type is numeric
Attr60's type is numeric
Attr61's type is numeric
Attr62's type is numeric
Attr63's type is numeric
Attr64's type is numeric
class's type is nominal, range is ('0', '1')
Atrybut class jest etykietą klasy, która wskazuje status bankructwa po 2 latach. Wartość 1 oznacza, że firma zbankrutowała, a wartość 0 oznacza, że firma nie zbankrutowała.
Atrybut class jest typem nominalnym, zatem nie ma sensu znajdowania średniej lub mediany.
Wszystkie inne atrybuty są numeryczne. Według źródła drugiego pewne atrybuty są liczbami całkowitymi a reszta jest rzeczywistymi. A6, A59 są atrybutami całkowitymi, a reszta jest rzeczywista.
- X6 retained earnings / total assets
- X59 long-term liabilities / equity
df = pd.DataFrame(data)
df.describe()
| Attr1 | Attr2 | Attr3 | Attr4 | Attr5 | Attr6 | Attr7 | Attr8 | Attr9 | Attr10 | Attr11 | Attr12 | Attr13 | Attr14 | Attr15 | Attr16 | Attr17 | Attr18 | Attr19 | Attr20 | Attr21 | Attr22 | Attr23 | Attr24 | Attr25 | Attr26 | Attr27 | Attr28 | Attr29 | Attr30 | Attr31 | Attr32 | Attr33 | Attr34 | Attr35 | Attr36 | Attr37 | Attr38 | Attr39 | Attr40 | Attr41 | Attr42 | Attr43 | Attr44 | Attr45 | Attr46 | Attr47 | Attr48 | Attr49 | Attr50 | Attr51 | Attr52 | Attr53 | Attr54 | Attr55 | Attr56 | Attr57 | Attr58 | Attr59 | Attr60 | Attr61 | Attr62 | Attr63 | Attr64 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 9791.000000 | 9791.000000 | 9791.000000 | 9749.000000 | 9.771000e+03 | 9791.000000 | 9791.000000 | 9773.000000 | 9792.000000 | 9791.000000 | 9791.000000 | 9749.000000 | 9771.000000 | 9791.000000 | 9.784000e+03 | 9773.000000 | 9773.000000 | 9791.000000 | 9771.000000 | 9771.000000 | 9634.000000 | 9791.000000 | 9771.000000 | 9581.000000 | 9791.000000 | 9773.000000 | 9.151000e+03 | 9561.000000 | 9791.000000 | 9771.000000 | 9771.000000 | 9696.000000 | 9749.000000 | 9773.000000 | 9791.000000 | 9791.000000 | 5350.000000 | 9791.000000 | 9771.000000 | 9749.000000 | 9605.000000 | 9771.000000 | 9.771000e+03 | 9.771000e+03 | 9179.000000 | 9749.000000 | 9719.000000 | 9791.000000 | 9771.000000 | 9773.000000 | 9791.000000 | 9716.000000 | 9561.000000 | 9561.000000 | 9.792000e+03 | 9771.000000 | 9791.000000 | 9776.000000 | 9791.000000 | 9178.000000 | 9760.000000 | 9.771000e+03 | 9749.000000 | 9561.000000 |
| mean | 0.043019 | 0.596404 | 0.130959 | 8.136600 | 6.465164e+01 | -0.059273 | 0.059446 | 19.884016 | 1.882296 | 0.389040 | 0.075417 | 0.210989 | 0.398902 | 0.059460 | 3.017681e+03 | 0.617918 | 20.976033 | 0.064580 | -0.019081 | 62.704589 | 1.218724 | 0.066203 | -0.070364 | 0.247742 | 0.222839 | 0.451115 | 1.115883e+03 | 6.725180 | 3.946479 | 5.353531 | 0.041258 | 341.625124 | 8.445313 | 4.979157 | 0.058091 | 2.077261 | 70.659877 | 0.487190 | -1.072578 | 3.064235 | 0.968902 | -0.371479 | 7.356944e+02 | 6.729892e+02 | 5.458024 | 7.274189 | 112.989701 | -0.002370 | -0.517222 | 7.085001 | 0.469319 | 10.031638 | 6.114681 | 7.402928 | 7.686330e+03 | -0.992263 | 0.035022 | 1.133287 | 0.856053 | 118.156064 | 25.194430 | 2.015157e+03 | 8.660813 | 35.949619 |
| std | 0.359321 | 4.587122 | 4.559074 | 290.647281 | 1.475939e+04 | 6.812754 | 0.533344 | 698.697015 | 17.674650 | 4.590299 | 0.528232 | 74.237274 | 37.974787 | 0.533344 | 1.022731e+05 | 78.494223 | 698.757245 | 0.736143 | 25.583613 | 377.204157 | 5.930840 | 0.504481 | 23.889882 | 8.268015 | 4.852418 | 74.037751 | 3.143938e+04 | 147.963574 | 0.865714 | 340.974268 | 25.585724 | 6145.604519 | 69.690183 | 58.480776 | 0.483463 | 17.341615 | 621.311292 | 4.578432 | 77.056762 | 87.916989 | 41.191681 | 14.174896 | 3.283705e+04 | 3.281128e+04 | 186.414617 | 290.619843 | 1993.125597 | 0.525467 | 15.737098 | 287.770829 | 4.554869 | 897.307846 | 90.190534 | 146.013868 | 7.605261e+04 | 77.007971 | 8.945365 | 8.038201 | 26.393305 | 3230.316692 | 1099.260821 | 1.171461e+05 | 60.838202 | 483.318623 |
| min | -12.458000 | 0.000000 | -445.910000 | -0.045319 | -3.794600e+05 | -486.820000 | -12.458000 | -1.848200 | -0.032371 | -445.910000 | -12.244000 | -6331.800000 | -1460.600000 | -12.458000 | -1.567500e+06 | -6331.800000 | 0.000857 | -12.458000 | -1578.700000 | 0.000000 | -1.146300 | -12.244000 | -1578.700000 | -314.370000 | -466.340000 | -6331.800000 | -2.590100e+05 | -990.020000 | -0.440090 | -4940.000000 | -1495.600000 | 0.000000 | 0.000000 | -756.500000 | -9.043100 | -0.000014 | -3.715000 | -445.910000 | -7522.000000 | -8.833300 | -1086.800000 | -719.800000 | -1.158700e+05 | -1.158700e+05 | -2834.900000 | -6.639200 | -3.630700 | -13.815000 | -837.860000 | -0.045239 | 0.000000 | 0.000000 | -1033.700000 | -1033.700000 | -7.132200e+05 | -7522.100000 | -597.420000 | -30.892000 | -284.380000 | 0.000000 | -12.656000 | -1.496500e+04 | -0.024390 | -0.000015 |
| 25% | 0.001321 | 0.263145 | 0.020377 | 1.047000 | -5.121700e+01 | -0.000578 | 0.003004 | 0.428300 | 1.006675 | 0.294440 | 0.009457 | 0.007608 | 0.021204 | 0.003008 | 2.173550e+02 | 0.061874 | 1.448900 | 0.003008 | 0.001937 | 15.244500 | 0.920125 | 0.000000 | 0.000963 | 0.004049 | 0.135600 | 0.057772 | 0.000000e+00 | 0.035097 | 3.398450 | 0.082730 | 0.004434 | 47.222750 | 2.702300 | 0.297400 | 0.000872 | 1.030900 | 1.097050 | 0.419750 | 0.000687 | 0.051234 | 0.025029 | 0.000000 | 6.890100e+01 | 3.651950e+01 | 0.010131 | 0.613890 | 15.836000 | -0.047398 | -0.035386 | 0.768760 | 0.185285 | 0.128978 | 0.684900 | 0.946990 | 2.184000e+01 | 0.003121 | 0.008768 | 0.885722 | 0.000000 | 5.356325 | 4.267700 | 4.323400e+01 | 2.938800 | 2.012900 |
| 50% | 0.041364 | 0.467740 | 0.199290 | 1.591800 | -5.557600e-02 | 0.000000 | 0.048820 | 1.088700 | 1.161300 | 0.510450 | 0.062544 | 0.143690 | 0.063829 | 0.048859 | 9.065550e+02 | 0.219130 | 2.134600 | 0.048859 | 0.030967 | 35.657000 | 1.045700 | 0.050118 | 0.025931 | 0.149150 | 0.386750 | 0.199240 | 1.005500e+00 | 0.470770 | 3.976400 | 0.226950 | 0.037939 | 80.884500 | 4.467900 | 1.975500 | 0.046855 | 1.559400 | 3.110400 | 0.613270 | 0.030427 | 0.182500 | 0.089499 | 0.032384 | 1.034800e+02 | 5.786200e+01 | 0.232560 | 1.041100 | 38.482000 | 0.006990 | 0.004644 | 1.230000 | 0.336680 | 0.221380 | 1.211800 | 1.378300 | 9.503300e+02 | 0.043679 | 0.098026 | 0.958305 | 0.002129 | 9.482000 | 6.283550 | 7.472900e+01 | 4.848900 | 4.041600 |
| 75% | 0.111130 | 0.689255 | 0.410670 | 2.880400 | 5.573200e+01 | 0.065322 | 0.126940 | 2.691000 | 1.970225 | 0.714290 | 0.140805 | 0.513820 | 0.127935 | 0.126960 | 2.412500e+03 | 0.599580 | 3.781500 | 0.126960 | 0.083315 | 65.667500 | 1.208925 | 0.126675 | 0.071692 | 0.360600 | 0.614030 | 0.542190 | 5.236100e+00 | 1.570000 | 4.498550 | 0.434085 | 0.093553 | 133.202500 | 7.627100 | 4.509700 | 0.126375 | 2.284250 | 12.239750 | 0.777415 | 0.080855 | 0.666480 | 0.215980 | 0.081628 | 1.481050e+02 | 8.510950e+01 | 0.816100 | 1.971200 | 70.936000 | 0.084418 | 0.051082 | 2.268800 | 0.530775 | 0.364155 | 2.274200 | 2.426300 | 4.694550e+03 | 0.117170 | 0.242680 | 0.996163 | 0.211790 | 19.506000 | 9.938200 | 1.233450e+02 | 8.363800 | 9.413500 |
| max | 20.482000 | 446.910000 | 22.769000 | 27146.000000 | 1.034100e+06 | 322.200000 | 38.618000 | 53209.000000 | 1704.800000 | 12.602000 | 38.618000 | 3340.900000 | 2707.700000 | 38.618000 | 8.085500e+06 | 4401.300000 | 53210.000000 | 50.266000 | 1082.600000 | 26606.000000 | 396.160000 | 38.618000 | 879.860000 | 400.590000 | 12.602000 | 3594.600000 | 2.037300e+06 | 11864.000000 | 9.651800 | 29526.000000 | 1083.100000 | 385590.000000 | 5534.100000 | 4260.200000 | 38.618000 | 1704.800000 | 24487.000000 | 12.602000 | 112.020000 | 8007.100000 | 3443.400000 | 160.110000 | 3.020000e+06 | 3.020000e+06 | 10337.000000 | 27146.000000 | 140990.000000 | 33.535000 | 107.680000 | 27146.000000 | 446.910000 | 88433.000000 | 4784.100000 | 11678.000000 | 6.123700e+06 | 112.020000 | 226.760000 | 668.750000 | 1661.000000 | 251570.000000 | 108000.000000 | 1.077900e+07 | 5662.400000 | 21153.000000 |
sns.countplot(x='class', data=df, hue='class')
plt.title('Class Distribution')
plt.legend(loc='upper right', title='class', labels=['Non-bankcrupt', 'Bankcrupt'])
plt.show()
Obserwacja:
Widać, że dane ze względu na klasyfikacje nie są równomiernie podzielone.
# make a pairplot of missing values for each feature
plt.figure(figsize=(16, 10))
sns.heatmap(df.isnull(), cbar=False)
plt.title('Missing Values')
plt.xlabel('Features')
plt.ylabel('Data Points')
plt.show()
df.isnull().sum().sort_values(ascending=False)
Attr37 4442 Attr27 641 Attr60 614 Attr45 613 Attr64 231 Attr53 231 Attr28 231 Attr54 231 Attr24 211 Attr41 187 Attr21 158 Attr32 96 Attr52 76 Attr47 73 Attr33 43 Attr46 43 Attr4 43 Attr40 43 Attr63 43 Attr12 43 Attr61 32 Attr30 21 Attr44 21 Attr43 21 Attr42 21 Attr5 21 Attr39 21 Attr49 21 Attr62 21 Attr31 21 Attr20 21 Attr19 21 Attr13 21 Attr23 21 Attr56 21 Attr16 19 Attr17 19 Attr34 19 Attr50 19 Attr26 19 Attr8 19 Attr58 16 Attr15 8 Attr51 1 Attr57 1 Attr59 1 Attr48 1 Attr1 1 Attr38 1 Attr36 1 Attr3 1 Attr6 1 Attr7 1 Attr10 1 Attr11 1 Attr14 1 Attr18 1 Attr22 1 Attr25 1 Attr29 1 Attr2 1 Attr35 1 Attr55 0 Attr9 0 class 0 dtype: int64
Po wypisaniu ile jest wartości brakujących oraz uporządkowaniu ich malejąco możemy zauważuć, że atrybut 37 ma najwięcej wartości brakujących równych 4442. Jest to dokładnie 4442/9792 = 45.4% wartości brakujących, Decyzja o usunięciu atrybutu 37 została podjęta na podstawie tego, że jest to zbyt duża ilość wartości brakujących.
Kolejny atrybut 27 który ma 641 wartości brakujących, co stanowi 641/9792 = 6.5% wartości brakujących. Z racji tego, że jest to niewielka ilość wartości brakujących, przebadamy czy jednak wartości te są istotne dla naszej analizy.
print(df[df['Attr27'].isnull()]['class'].value_counts())
class b'0' 487 b'1' 154 Name: count, dtype: int64
Okazuje się, że te przypadki posiadające brak wartości atrybutu 27, posiadają aż 154 etykiety bankructwa, co Stanowi 154/515 = 29% danych gdzie nastąpiło bankructwo W całym zbiorze, dlatego nie będziemy usuwać rekordów ze zbioru.
Kolejno atrybuty z brakami...
print(df[df['Attr60'].isnull()]['class'].value_counts())
class b'0' 561 b'1' 53 Name: count, dtype: int64
print(df[df['Attr45'].isnull()]['class'].value_counts())
class b'0' 560 b'1' 53 Name: count, dtype: int64
print(df[df['Attr64'].isnull()]['class'].value_counts())
class b'0' 203 b'1' 28 Name: count, dtype: int64
Kolejne atrybuty zawieraja również rekordy związane z bankructwe, dlatego nie będziemy usuwać rekordów ze zbioru ani usuwać kolumny, a jedynie uzupełnimy braki w danych.
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(10,35))
sns.boxplot(data=df, orient="h")
plt.title('Boxplot of Features')
plt.xlabel('Values')
plt.ylabel('Features')
plt.show()
Po tym zestawieniu wykresów typu Box, od razu widać, że mamy do czynienia z wartościami odstającymi w poszczególych atrybutach.
Dodatkowo Boxploty są praktycznie niewidoczne, co oznacza, że mamy do czynienia z wartościami odstającymi w każdym atrybucie.
plt.figure(figsize=(10,5))
sns.boxplot(data=df['Attr1'], orient="h")
plt.title('Attr1 Boxplot')
plt.xlabel('Values')
plt.ylabel('Attr1')
plt.show()
plt.figure(figsize=(10,5))
sns.boxplot(data=df['Attr11'], orient="h")
plt.title('Attr11 Boxplot')
plt.xlabel('Values')
plt.ylabel('Attr11')
plt.show()
Nawet dla przykładowych atrybutów, które mają niską wartość odchylenia standardowego, jak np. atrybut 1, widać, że mamy do czynienia z wartościami odstającymi i boxploty są już trochę bardziej widoczne ale nadal wąskie.
2. Wyczyszczenie zbioru danych o brakujące wartości, wartości ujemne, różna skala wartości atrybutów (normalizacja, standaryzacja). Patrz literatura [4, 5].¶
Decyzja o usunięciu atrybutu 37 została podjęta na podstawie tego, że jest to zbyt duża ilość wartości brakujących, co stanowi 45.4% wartości brakujących, co jest zbyt dużą ilością, aby uzupełniać braki w danych.
df = df.drop('Attr27', axis=1)
Braki w kolejnych atrybutach to maksymalnie 641/9792 = 6.5% wartości brakujących, co jest niewielką ilością, dlatego zdecydowano się na uzupełnienie brakujących wartości.
missing_columns = df.columns[df.isnull().any()].tolist()
from sklearn.impute import SimpleImputer
attributes = missing_columns
for attr in attributes:
skewness = df[attr].skew()
imputer = SimpleImputer(strategy='mean' if abs(skewness) < 0.5 else 'median')
df[attr] = imputer.fit_transform(df[[attr]])
if abs(skewness) < 0.5:
print(f"Atrybut {attr} jest symetryczny. Skośność: {skewness}. Zastąp brakujące wartości średnią.")
else:
if skewness > 0:
print(f"Atrybut {attr} jest prawostronnie skośny. Skośność: {skewness}. Zastąp brakujące wartości medianą.")
else:
print(f"Atrybut {attr} jest lewostronnie skośny. Skośność: {skewness}. Zastąp brakujące wartości medianą.")
Atrybut Attr1 jest prawostronnie skośny. Skośność: 13.467518228865412. Zastąp brakujące wartości medianą. Atrybut Attr2 jest prawostronnie skośny. Skośność: 94.23602939482838. Zastąp brakujące wartości medianą. Atrybut Attr3 jest lewostronnie skośny. Skośność: -95.70656988633446. Zastąp brakujące wartości medianą. Atrybut Attr4 jest prawostronnie skośny. Skośność: 86.06915280975713. Zastąp brakujące wartości medianą. Atrybut Attr5 jest prawostronnie skośny. Skośność: 43.835753902533675. Zastąp brakujące wartości medianą. Atrybut Attr6 jest lewostronnie skośny. Skośność: -21.697536070930056. Zastąp brakujące wartości medianą. Atrybut Attr7 jest prawostronnie skośny. Skośność: 42.749747697328274. Zastąp brakujące wartości medianą. Atrybut Attr8 jest prawostronnie skośny. Skośność: 58.85986014465329. Zastąp brakujące wartości medianą. Atrybut Attr10 jest lewostronnie skośny. Skośność: -94.02810740094908. Zastąp brakujące wartości medianą. Atrybut Attr11 jest prawostronnie skośny. Skośność: 44.26663626067478. Zastąp brakujące wartości medianą. Atrybut Attr12 jest lewostronnie skośny. Skośność: -54.69677552071819. Zastąp brakujące wartości medianą. Atrybut Attr13 jest prawostronnie skośny. Skośność: 36.073219282723755. Zastąp brakujące wartości medianą. Atrybut Attr14 jest prawostronnie skośny. Skośność: 42.74961976122516. Zastąp brakujące wartości medianą. Atrybut Attr15 jest prawostronnie skośny. Skośność: 55.47711843228474. Zastąp brakujące wartości medianą. Atrybut Attr16 jest lewostronnie skośny. Skośność: -35.628273578296714. Zastąp brakujące wartości medianą. Atrybut Attr17 jest prawostronnie skośny. Skośność: 58.84945396150125. Zastąp brakujące wartości medianą. Atrybut Attr18 jest prawostronnie skośny. Skośność: 48.64885332661505. Zastąp brakujące wartości medianą. Atrybut Attr19 jest lewostronnie skośny. Skośność: -12.356586608052332. Zastąp brakujące wartości medianą. Atrybut Attr20 jest prawostronnie skośny. Skośność: 48.70550707367376. Zastąp brakujące wartości medianą. Atrybut Attr21 jest prawostronnie skośny. Skośność: 56.5556255643253. Zastąp brakujące wartości medianą. Atrybut Attr22 jest prawostronnie skośny. Skośność: 46.26921466931942. Zastąp brakujące wartości medianą. Atrybut Attr23 jest lewostronnie skośny. Skośność: -24.09197817879497. Zastąp brakujące wartości medianą. Atrybut Attr24 jest prawostronnie skośny. Skośność: 14.60678650386577. Zastąp brakujące wartości medianą. Atrybut Attr25 jest lewostronnie skośny. Skośność: -91.08910857612695. Zastąp brakujące wartości medianą. Atrybut Attr26 jest lewostronnie skośny. Skośność: -52.2761087355474. Zastąp brakujące wartości medianą. Atrybut Attr28 jest prawostronnie skośny. Skośność: 60.338643118714984. Zastąp brakujące wartości medianą. Atrybut Attr29 jest symetryczny. Skośność: -0.09216803975722088. Zastąp brakujące wartości średnią. Atrybut Attr30 jest prawostronnie skośny. Skośność: 70.68294257224264. Zastąp brakujące wartości medianą. Atrybut Attr31 jest lewostronnie skośny. Skośność: -5.281146969351796. Zastąp brakujące wartości medianą. Atrybut Attr32 jest prawostronnie skośny. Skośność: 45.54528427755971. Zastąp brakujące wartości medianą. Atrybut Attr33 jest prawostronnie skośny. Skośność: 66.02578140181893. Zastąp brakujące wartości medianą. Atrybut Attr34 jest prawostronnie skośny. Skośność: 63.65522769545239. Zastąp brakujące wartości medianą. Atrybut Attr35 jest prawostronnie skośny. Skośność: 53.0930250381841. Zastąp brakujące wartości medianą. Atrybut Attr36 jest prawostronnie skośny. Skośność: 96.74860440312183. Zastąp brakujące wartości medianą. Atrybut Attr37 jest prawostronnie skośny. Skośność: 23.70329591828747. Zastąp brakujące wartości medianą. Atrybut Attr38 jest lewostronnie skośny. Skośność: -94.8067253633291. Zastąp brakujące wartości medianą. Atrybut Attr39 jest lewostronnie skośny. Skośność: -95.36281584763287. Zastąp brakujące wartości medianą. Atrybut Attr40 jest prawostronnie skośny. Skośność: 80.15908074330225. Zastąp brakujące wartości medianą. Atrybut Attr41 jest prawostronnie skośny. Skośność: 60.62886605451784. Zastąp brakujące wartości medianą. Atrybut Attr42 jest lewostronnie skośny. Skośność: -40.35845415459756. Zastąp brakujące wartości medianą. Atrybut Attr43 jest prawostronnie skośny. Skośność: 82.0557315870161. Zastąp brakujące wartości medianą. Atrybut Attr44 jest prawostronnie skośny. Skośność: 82.21184494133858. Zastąp brakujące wartości medianą. Atrybut Attr45 jest prawostronnie skośny. Skośność: 44.45412252522891. Zastąp brakujące wartości medianą. Atrybut Attr46 jest prawostronnie skośny. Skośność: 86.10213600771331. Zastąp brakujące wartości medianą. Atrybut Attr47 jest prawostronnie skośny. Skośność: 52.18426409036969. Zastąp brakujące wartości medianą. Atrybut Attr48 jest prawostronnie skośny. Skośność: 25.33432634199644. Zastąp brakujące wartości medianą. Atrybut Attr49 jest lewostronnie skośny. Skośność: -39.19660551283668. Zastąp brakujące wartości medianą. Atrybut Attr50 jest prawostronnie skośny. Skośność: 88.22880782942197. Zastąp brakujące wartości medianą. Atrybut Attr51 jest prawostronnie skośny. Skośność: 96.24612936138438. Zastąp brakujące wartości medianą. Atrybut Attr52 jest prawostronnie skośny. Skośność: 98.51871649091412. Zastąp brakujące wartości medianą. Atrybut Attr53 jest prawostronnie skośny. Skośność: 32.90539820293406. Zastąp brakujące wartości medianą. Atrybut Attr54 jest prawostronnie skośny. Skośność: 60.128249480894056. Zastąp brakujące wartości medianą. Atrybut Attr56 jest lewostronnie skośny. Skośność: -95.54692007280859. Zastąp brakujące wartości medianą. Atrybut Attr57 jest lewostronnie skośny. Skośność: -45.61710531854953. Zastąp brakujące wartości medianą. Atrybut Attr58 jest prawostronnie skośny. Skośność: 66.37566701529. Zastąp brakujące wartości medianą. Atrybut Attr59 jest prawostronnie skośny. Skośność: 48.447640120864264. Zastąp brakujące wartości medianą. Atrybut Attr60 jest prawostronnie skośny. Skośność: 65.49077002364425. Zastąp brakujące wartości medianą. Atrybut Attr61 jest prawostronnie skośny. Skośność: 97.17717655626959. Zastąp brakujące wartości medianą. Atrybut Attr62 jest prawostronnie skośny. Skośność: 83.68270608250617. Zastąp brakujące wartości medianą. Atrybut Attr63 jest prawostronnie skośny. Skośność: 83.75994654229625. Zastąp brakujące wartości medianą. Atrybut Attr64 jest prawostronnie skośny. Skośność: 32.61179168566096. Zastąp brakujące wartości medianą. Atrybut Attr62 jest prawostronnie skośny. Skośność: 83.68270608250617. Zastąp brakujące wartości medianą. Atrybut Attr63 jest prawostronnie skośny. Skośność: 83.75994654229625. Zastąp brakujące wartości medianą. Atrybut Attr64 jest prawostronnie skośny. Skośność: 32.61179168566096. Zastąp brakujące wartości medianą.
plt.figure(figsize=(16, 10))
sns.heatmap(df.isnull(), cbar=False)
plt.title('Missing Values after Imputation')
plt.xlabel('Features')
plt.ylabel('Data Points')
plt.show()
df.isnull().sum().sort_values(ascending=False)
Attr1 0 Attr2 0 Attr36 0 Attr37 0 Attr38 0 Attr39 0 Attr40 0 Attr41 0 Attr42 0 Attr43 0 Attr44 0 Attr45 0 Attr46 0 Attr47 0 Attr48 0 Attr49 0 Attr50 0 Attr51 0 Attr52 0 Attr53 0 Attr54 0 Attr55 0 Attr56 0 Attr57 0 Attr58 0 Attr59 0 Attr60 0 Attr61 0 Attr62 0 Attr63 0 Attr64 0 Attr35 0 Attr34 0 Attr33 0 Attr16 0 Attr3 0 Attr4 0 Attr5 0 Attr6 0 Attr7 0 Attr8 0 Attr9 0 Attr10 0 Attr11 0 Attr12 0 Attr13 0 Attr14 0 Attr15 0 Attr17 0 Attr32 0 Attr18 0 Attr19 0 Attr20 0 Attr21 0 Attr22 0 Attr23 0 Attr24 0 Attr25 0 Attr26 0 Attr28 0 Attr29 0 Attr30 0 Attr31 0 class 0 dtype: int64
Sprawdzenie czy dane posiadają brakujące wartości
Uwaga: Wartości ujemne¶
Po przeanalizowaniu pochodzenia wartości ujemny w poszczególnych atrybutach można wywnioskować, nie są to wartości nieprawidłowe, ponieważ są to wskaźniki finansowe, które mogą być ujemne, np. wartość zysku netto może być ujemna, co oznacza stratę netto czy inne wskazniki finansowe.
Jeżeli zbiór danych będzie wykorzystywany do klasyfikacji, trzeba wziąć pod uwagę fakt, że wartości ujemne mogą wpływać na wyniki klasyfikacji, ponieważ niektóre modele klasyfikacyjne nie radzą sobie z wartościami ujemnymi. Normalizacja i standaryzacja danych może pomóc w rozwiązaniu tego problemu, aczkolwiek
Normalizacja vs Standaryzacja¶
Czym się różni normalizacja od standaryzacji?
Normalizacja i standaryzacja to dwa różne sposoby przekształcania wartości atrybutów, aby uzyskać wartości w określonym zakresie.
Normalizacja polega na przekształceniu wartości atrybutów w zakresie od 0 do 1. (wariant MinMaxScaler w scikit-learn)
Standaryzacja polega na przekształceniu wartości atrybutów w taki sposób, aby miały średnią 0 i odchylenie standardowe 1.
from sklearn.preprocessing import StandardScaler, MinMaxScaler
df_normalized = df.copy()
df_standardized = df.copy()
df_normalized = df_normalized.drop('class', axis=1)
df_standardized = df_standardized.drop('class', axis=1)
scaler = StandardScaler()
normalizer = MinMaxScaler()
df_normalized[df_normalized.columns] = normalizer.fit_transform(df_normalized)
df_standardized[df_standardized.columns] = scaler.fit_transform(df_standardized)
df_normalized['class'] = df['class']
df_standardized['class'] = df['class']
3. Analiza wartości odstających z wykorzystaniem np. Z-Score lub jednego z algorytmów Outlier Detection. Patrz literatura [6].¶
Do analizy wartości został użyty algorytm Isolation Forest, który jest jednym z algorytmów wykrywania wartości odstających.
Isolation Forest jest algorytmem wykrywania wartości odstających, który wykorzystuje drzewa decyzyjne do wykrywania wartości odstających.
Algorytm ten działa na zasadzie izolacji wartości odstających od reszty danych. Algorytm ten jest szybki i skuteczny w wykrywaniu wartości odstających.
from sklearn.ensemble import IsolationForest
clf = IsolationForest(random_state=0, contamination=0.1) #W tym przypadku jest to 0.1, co oznacza, że oczekuje się, że 10% obserwacji będzie odstających.
clf.fit(df_normalized)
y_pred = clf.predict(df_normalized)
df_norm_outliers = df_normalized.copy()
df_norm_outliers['outliers'] = y_pred
df_norm_outliers['outliers'].value_counts()
outliers = df_norm_outliers[df_norm_outliers['outliers'] == -1]
inliers = df_norm_outliers[df_norm_outliers['outliers'] == 1]
print(f"Number of outliers: {outliers.shape[0]}")
print(f"Number of inliers: {inliers.shape[0]}")
print(df_norm_outliers[df_norm_outliers['outliers'] == -1]['class'].value_counts())
df_norm_outliers = df_norm_outliers[df_norm_outliers['outliers'] == 1]
df_norm_outliers = df_norm_outliers.drop('outliers', axis=1)
Number of outliers: 980 Number of inliers: 8812 class b'0' 854 b'1' 126 Name: count, dtype: int64
Tutaj problematyczne może być usuwanie outlierów, ponieważ wewnątrz nich zawarte są firmy, które zbankrutowały, a to jest nasz cel klasyfikacji. Usuwając wartości odstające, możemy usunąć wartości, które są kluczowe dla klasyfikacji.
clf = IsolationForest(random_state=0, contamination=0.1)
clf.fit(df_standardized)
y_pred = clf.predict(df_standardized)
df_std_outliers = df_standardized.copy()
df_std_outliers['outliers'] = y_pred
df_std_outliers['outliers'].value_counts()
outliers = df_std_outliers[df_std_outliers['outliers'] == -1]
inliers = df_std_outliers[df_std_outliers['outliers'] == 1]
print(f"Number of outliers: {outliers.shape[0]}")
print(f"Number of inliers: {inliers.shape[0]}")
print(df_std_outliers[df_std_outliers['outliers'] == -1]['class'].value_counts())
df_std_outliers = df_std_outliers[df_std_outliers['outliers'] == 1]
df_std_outliers = df_std_outliers.drop('outliers', axis=1)
Number of outliers: 980 Number of inliers: 8812 class b'0' 854 b'1' 126 Name: count, dtype: int64
Dla mechanizmu detekcji outlierów nie ma znaczenia jakie przeskalowanie było użyte.
Kolejnym rozwiązaniem byłoby usuwanie wartości odstających częstszej klasy.
fig, axes = plt.subplots(1, 2, figsize=(20, 35))
sns.boxplot(data=df_normalized, orient="h", ax=axes[0])
sns.boxplot(data=df_norm_outliers, orient="h", ax=axes[1])
axes[0].set_title('Normalized with outliers (contamination=0.1)')
axes[1].set_title('Normalized without outliers (contamination=0.1)')
plt.show()
fig, axes = plt.subplots(1, 2, figsize=(20, 35))
sns.boxplot(data=df_standardized, orient="h", ax=axes[0])
sns.boxplot(data=df_std_outliers, orient="h", ax=axes[1])
axes[0].set_title('Standardized with outliers (contamination=0.1)')
axes[1].set_title('Standardized without outliers (contamination=0.1)')
plt.show()
fig, axes = plt.subplots(1, 2, figsize=(20, 5))
sns.boxplot(data=df_normalized['Attr29'], orient="h", ax=axes[0])
sns.boxplot(data=df_norm_outliers['Attr29'], orient="h", ax=axes[1])
axes[0].set_title('Attr29 with outliers (contamination=0.1) Boxplot')
axes[1].set_title('Attr29 inliers (contamination=0.1) Boxplot')
axes[0].set_xlabel('Values')
axes[0].set_ylabel('Attr29')
axes[1].set_xlabel('Values')
axes[1].set_ylabel('Attr29')
plt.show()
Isolation Forest 0.3¶
from sklearn.ensemble import IsolationForest
clf = IsolationForest(random_state=0, contamination=0.3)
clf.fit(df_normalized)
y_pred = clf.predict(df_normalized)
df_norm_outliers = df_normalized.copy()
df_norm_outliers['outliers'] = y_pred
df_norm_outliers['outliers'].value_counts()
outliers = df_norm_outliers[df_norm_outliers['outliers'] == -1]
df_norm_inliers = df_norm_outliers[df_norm_outliers['outliers'] == 1]
print(f"Number of outliers: {outliers.shape[0]}")
print(f"Number of inliers: {df_norm_inliers.shape[0]}")
print(df_norm_outliers[df_norm_outliers['outliers'] == -1]['class'].value_counts())
df_norm_outliers = df_norm_outliers[df_norm_outliers['outliers'] == 1]
df_norm_outliers = df_norm_outliers.drop('outliers', axis=1)
df_norm_inliers = df_norm_inliers.drop('outliers', axis=1)
Number of outliers: 2938 Number of inliers: 6854 class b'0' 2647 b'1' 291 Name: count, dtype: int64
fig, axes = plt.subplots(1, 2, figsize=(20, 35))
sns.boxplot(data=df_normalized, orient="h", ax=axes[0])
sns.boxplot(data=df_norm_inliers, orient="h", ax=axes[1])
axes[0].set_title('Normalized with outliers (contamination=0.3)')
axes[1].set_title('Normalized without outliers (contamination=0.3)')
plt.show()
fig, axes = plt.subplots(1, 2, figsize=(20, 5))
sns.boxplot(data=df_normalized['Attr1'], orient="h", ax=axes[0])
sns.boxplot(data=df_norm_inliers['Attr1'], orient="h", ax=axes[1])
axes[0].set_title('Attr1 with outliers Boxplot')
axes[1].set_title('Attr1 inliers (contamination=0.3) Boxplot')
axes[0].set_xlabel('Values')
axes[0].set_ylabel('Attr1')
axes[1].set_xlabel('Values')
axes[1].set_ylabel('Attr1')
plt.show()
clf = IsolationForest(random_state=0, contamination=0.3)
clf.fit(df_standardized)
y_pred = clf.predict(df_standardized)
df_std_outliers = df_standardized.copy()
df_std_outliers['outliers'] = y_pred
df_std_outliers['outliers'].value_counts()
std_outliers = df_std_outliers[df_std_outliers['outliers'] == -1]
std_inliers = df_std_outliers[df_std_outliers['outliers'] == 1]
print(f"Number of outliers: {outliers.shape[0]}")
print(f"Number of inliers: {std_inliers.shape[0]}")
print(std_outliers[std_outliers['outliers'] == -1]['class'].value_counts())
df_std_outliers = df_std_outliers[df_std_outliers['outliers'] == 1]
std_inliers = std_inliers.drop('outliers', axis=1)
Number of outliers: 2938 Number of inliers: 6854 class b'0' 2646 b'1' 292 Name: count, dtype: int64
fig, axes = plt.subplots(1, 2, figsize=(20, 35))
sns.boxplot(data=df_standardized, orient="h", ax=axes[0])
sns.boxplot(data=std_inliers, orient="h", ax=axes[1])
axes[0].set_title('Standarized with outliers (contamination=0.3)')
axes[1].set_title('Standarized without outliers (contamination=0.3)')
plt.show()
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
df_norm_inliers['class'].value_counts().plot.pie(autopct='%1.1f%%', startangle=140, shadow=True, labels=['Non-bankcrupt', 'Bankcrupt'], ax=axes[0])
axes[0].set_title('Class Distribution without outliers (contamination=0.3)')
df_normalized['class'].value_counts().plot.pie(autopct='%1.1f%%', startangle=140, shadow=True, labels=['Non-bankcrupt', 'Bankcrupt'], ax=axes[1])
axes[1].set_title('Class Distribution with outliers')
plt.show()
Ponowna normalizacja po usunięciu outlierów:
Dla pózniejszego PCA wymiary powinny być przeskalowane, ponieważ PCA jest wrażliwa na skalę danych.
df_norm_inliers_norm = df_norm_inliers.copy()
df_norm_inliers_norm = df_norm_inliers_norm.drop('class', axis=1)
df_norm_inliers_norm[df_norm_inliers_norm.columns] = normalizer.fit_transform(df_norm_inliers_norm)
df_norm_inliers_norm['class'] = df_norm_outliers['class']
fig, axes = plt.subplots(1, 2, figsize=(20, 35))
sns.boxplot(data=df_norm_outliers, orient="h", ax=axes[0])
sns.boxplot(data=df_norm_inliers_norm, orient="h", ax=axes[1])
axes[0].set_title('Normalized without outliers (contamination=0.3)')
axes[1].set_title('Normalized normalized without outliers (contamination=0.3)')
plt.show()
# ponowana standaryzacja danych bez odstających wartości
df_std_inliers = std_inliers.copy()
df_std_inliers = df_std_inliers.drop('class', axis=1)
df_std_inliers[df_std_inliers.columns] = scaler.fit_transform(df_std_inliers)
df_std_inliers['class'] = std_inliers['class']
fig, axes = plt.subplots(1, 2, figsize=(20, 35))
sns.boxplot(data=std_inliers, orient="h", ax=axes[0])
sns.boxplot(data=df_std_inliers, orient="h", ax=axes[1])
axes[0].set_title('Standardized without outliers (contamination=0.3)')
axes[1].set_title('Standardized standardized without outliers (contamination=0.3)')
plt.show()
4. Analiza zbiorów danych przy wykorzystaniu dwóch algorytmów redukcji wymiarów, np. PCA, t-SNE, UMAP. Patrz literatura [7-11].¶
# podział danych na dane i etykiety
# dane oryginalne
y = df['class']
x = df.drop('class', axis=1)
# dane po normalizacji
X_normalized = df_normalized.drop('class', axis=1)
# dane po standaryzacji
X_standardized = df_standardized.drop('class', axis=1)
# dane po normalizacji i usunięciu outlierów
Y_norm_inliers = df_norm_inliers['class']
X_norm_inliers = df_norm_inliers_norm.drop('class', axis=1)
# dane po standaryzacji i usunięciu outlierów
Y_std_inliers = df_std_inliers['class']
X_std_inliers = df_std_inliers.drop('class', axis=1)
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
import numpy as np
pca = PCA()
pca.fit(x)
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) + 1
print(f"Number of components needed to explain 95% of variance for original data: {d}")
pca_normalized = PCA()
pca_normalized.fit(X_normalized)
cumsum_normalized = np.cumsum(pca_normalized.explained_variance_ratio_)
d_normalized = np.argmax(cumsum_normalized >= 0.95) + 1
print(f"Number of components needed to explain 95% of variance for normalized data: {d_normalized}")
pca_standardized = PCA()
pca_standardized.fit(X_standardized)
cumsum_standardized = np.cumsum(pca_standardized.explained_variance_ratio_)
d_standardized = np.argmax(cumsum_standardized >= 0.95) + 1
print(f"Number of components needed to explain 95% of variance for standardized data: {d_standardized}")
pca_norm_inliers = PCA()
pca_norm_inliers.fit(X_norm_inliers)
cumsum_norm_inliers = np.cumsum(pca_norm_inliers.explained_variance_ratio_)
d_norm_inliers = np.argmax(cumsum_norm_inliers >= 0.95) + 1
print(f"Number of components needed to explain 95% of variance for normalized data without outliers: {d_norm_inliers}")
pca_std_inliers = PCA()
pca_std_inliers.fit(X_std_inliers)
cumsum_std_inliers = np.cumsum(pca_std_inliers.explained_variance_ratio_)
d_std_inliers = np.argmax(cumsum_std_inliers >= 0.95) + 1
print(f"Number of components needed to explain 95% of variance for standardized data without outliers: {d_std_inliers}")
plt.figure(figsize=(16, 5))
sns.barplot(x=np.arange(1, len(cumsum)+1), y=cumsum, label='Original', color='blue', alpha=0.5)
sns.barplot(x=np.arange(1, len(cumsum_normalized)+1), y=cumsum_normalized, label='Normalized', color='green', alpha=0.5)
sns.barplot(x=np.arange(1, len(cumsum_standardized)+1), y=cumsum_standardized, label='Standardized', color='orange', alpha=0.5)
sns.barplot(x=np.arange(1, len(cumsum_norm_inliers)+1), y=cumsum_norm_inliers, label='Normalized without outliers', color='red', alpha=0.5)
sns.barplot(x=np.arange(1, len(cumsum_std_inliers)+1), y=cumsum_std_inliers, label='Standardized without outliers', color='purple', alpha=0.5)
plt.axhline(y=0.95, color='r', linestyle='--')
plt.xlabel('Number of Components')
plt.ylabel('Explained Variance')
plt.title('Explained Variance by Number of Components')
plt.legend(loc='upper right')
plt.show()
Number of components needed to explain 95% of variance for original data: 4 Number of components needed to explain 95% of variance for normalized data: 25 Number of components needed to explain 95% of variance for standardized data: 28 Number of components needed to explain 95% of variance for normalized data without outliers: 16 Number of components needed to explain 95% of variance for standardized data without outliers: 27
Wniosek:
Dla przeprowadzonej normalizacji wystarczy 24 składowych, które wyjaśniają 95% wariancji.
W przypadku przeprowadzenia standaryzacji, potrzeba aż 28 składowych, aby wyjaśnić 95% skumulowanej wariancji.
I tak zwizualizjmy dane za pomocą PCA w dwóch wymiarach, aby zobaczyć jak wyglądają nasze dane.
print(x.info())
print(X_normalized.info())
print(X_standardized.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 9792 entries, 0 to 9791 Data columns (total 63 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Attr1 9792 non-null float64 1 Attr2 9792 non-null float64 2 Attr3 9792 non-null float64 3 Attr4 9792 non-null float64 4 Attr5 9792 non-null float64 5 Attr6 9792 non-null float64 6 Attr7 9792 non-null float64 7 Attr8 9792 non-null float64 8 Attr9 9792 non-null float64 9 Attr10 9792 non-null float64 10 Attr11 9792 non-null float64 11 Attr12 9792 non-null float64 12 Attr13 9792 non-null float64 13 Attr14 9792 non-null float64 14 Attr15 9792 non-null float64 15 Attr16 9792 non-null float64 16 Attr17 9792 non-null float64 17 Attr18 9792 non-null float64 18 Attr19 9792 non-null float64 19 Attr20 9792 non-null float64 20 Attr21 9792 non-null float64 21 Attr22 9792 non-null float64 22 Attr23 9792 non-null float64 23 Attr24 9792 non-null float64 24 Attr25 9792 non-null float64 25 Attr26 9792 non-null float64 26 Attr28 9792 non-null float64 27 Attr29 9792 non-null float64 28 Attr30 9792 non-null float64 29 Attr31 9792 non-null float64 30 Attr32 9792 non-null float64 31 Attr33 9792 non-null float64 32 Attr34 9792 non-null float64 33 Attr35 9792 non-null float64 34 Attr36 9792 non-null float64 35 Attr37 9792 non-null float64 36 Attr38 9792 non-null float64 37 Attr39 9792 non-null float64 38 Attr40 9792 non-null float64 39 Attr41 9792 non-null float64 40 Attr42 9792 non-null float64 41 Attr43 9792 non-null float64 42 Attr44 9792 non-null float64 43 Attr45 9792 non-null float64 44 Attr46 9792 non-null float64 45 Attr47 9792 non-null float64 46 Attr48 9792 non-null float64 47 Attr49 9792 non-null float64 48 Attr50 9792 non-null float64 49 Attr51 9792 non-null float64 50 Attr52 9792 non-null float64 51 Attr53 9792 non-null float64 52 Attr54 9792 non-null float64 53 Attr55 9792 non-null float64 54 Attr56 9792 non-null float64 55 Attr57 9792 non-null float64 56 Attr58 9792 non-null float64 57 Attr59 9792 non-null float64 58 Attr60 9792 non-null float64 59 Attr61 9792 non-null float64 60 Attr62 9792 non-null float64 61 Attr63 9792 non-null float64 62 Attr64 9792 non-null float64 dtypes: float64(63) memory usage: 4.7 MB None <class 'pandas.core.frame.DataFrame'> RangeIndex: 9792 entries, 0 to 9791 Data columns (total 63 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Attr1 9792 non-null float64 1 Attr2 9792 non-null float64 2 Attr3 9792 non-null float64 3 Attr4 9792 non-null float64 4 Attr5 9792 non-null float64 5 Attr6 9792 non-null float64 6 Attr7 9792 non-null float64 7 Attr8 9792 non-null float64 8 Attr9 9792 non-null float64 9 Attr10 9792 non-null float64 10 Attr11 9792 non-null float64 11 Attr12 9792 non-null float64 12 Attr13 9792 non-null float64 13 Attr14 9792 non-null float64 14 Attr15 9792 non-null float64 15 Attr16 9792 non-null float64 16 Attr17 9792 non-null float64 17 Attr18 9792 non-null float64 18 Attr19 9792 non-null float64 19 Attr20 9792 non-null float64 20 Attr21 9792 non-null float64 21 Attr22 9792 non-null float64 22 Attr23 9792 non-null float64 23 Attr24 9792 non-null float64 24 Attr25 9792 non-null float64 25 Attr26 9792 non-null float64 26 Attr28 9792 non-null float64 27 Attr29 9792 non-null float64 28 Attr30 9792 non-null float64 29 Attr31 9792 non-null float64 30 Attr32 9792 non-null float64 31 Attr33 9792 non-null float64 32 Attr34 9792 non-null float64 33 Attr35 9792 non-null float64 34 Attr36 9792 non-null float64 35 Attr37 9792 non-null float64 36 Attr38 9792 non-null float64 37 Attr39 9792 non-null float64 38 Attr40 9792 non-null float64 39 Attr41 9792 non-null float64 40 Attr42 9792 non-null float64 41 Attr43 9792 non-null float64 42 Attr44 9792 non-null float64 43 Attr45 9792 non-null float64 44 Attr46 9792 non-null float64 45 Attr47 9792 non-null float64 46 Attr48 9792 non-null float64 47 Attr49 9792 non-null float64 48 Attr50 9792 non-null float64 49 Attr51 9792 non-null float64 50 Attr52 9792 non-null float64 51 Attr53 9792 non-null float64 52 Attr54 9792 non-null float64 53 Attr55 9792 non-null float64 54 Attr56 9792 non-null float64 55 Attr57 9792 non-null float64 56 Attr58 9792 non-null float64 57 Attr59 9792 non-null float64 58 Attr60 9792 non-null float64 59 Attr61 9792 non-null float64 60 Attr62 9792 non-null float64 61 Attr63 9792 non-null float64 62 Attr64 9792 non-null float64 dtypes: float64(63) memory usage: 4.7 MB None <class 'pandas.core.frame.DataFrame'> RangeIndex: 9792 entries, 0 to 9791 Data columns (total 63 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Attr1 9792 non-null float64 1 Attr2 9792 non-null float64 2 Attr3 9792 non-null float64 3 Attr4 9792 non-null float64 4 Attr5 9792 non-null float64 5 Attr6 9792 non-null float64 6 Attr7 9792 non-null float64 7 Attr8 9792 non-null float64 8 Attr9 9792 non-null float64 9 Attr10 9792 non-null float64 10 Attr11 9792 non-null float64 11 Attr12 9792 non-null float64 12 Attr13 9792 non-null float64 13 Attr14 9792 non-null float64 14 Attr15 9792 non-null float64 15 Attr16 9792 non-null float64 16 Attr17 9792 non-null float64 17 Attr18 9792 non-null float64 18 Attr19 9792 non-null float64 19 Attr20 9792 non-null float64 20 Attr21 9792 non-null float64 21 Attr22 9792 non-null float64 22 Attr23 9792 non-null float64 23 Attr24 9792 non-null float64 24 Attr25 9792 non-null float64 25 Attr26 9792 non-null float64 26 Attr28 9792 non-null float64 27 Attr29 9792 non-null float64 28 Attr30 9792 non-null float64 29 Attr31 9792 non-null float64 30 Attr32 9792 non-null float64 31 Attr33 9792 non-null float64 32 Attr34 9792 non-null float64 33 Attr35 9792 non-null float64 34 Attr36 9792 non-null float64 35 Attr37 9792 non-null float64 36 Attr38 9792 non-null float64 37 Attr39 9792 non-null float64 38 Attr40 9792 non-null float64 39 Attr41 9792 non-null float64 40 Attr42 9792 non-null float64 41 Attr43 9792 non-null float64 42 Attr44 9792 non-null float64 43 Attr45 9792 non-null float64 44 Attr46 9792 non-null float64 45 Attr47 9792 non-null float64 46 Attr48 9792 non-null float64 47 Attr49 9792 non-null float64 48 Attr50 9792 non-null float64 49 Attr51 9792 non-null float64 50 Attr52 9792 non-null float64 51 Attr53 9792 non-null float64 52 Attr54 9792 non-null float64 53 Attr55 9792 non-null float64 54 Attr56 9792 non-null float64 55 Attr57 9792 non-null float64 56 Attr58 9792 non-null float64 57 Attr59 9792 non-null float64 58 Attr60 9792 non-null float64 59 Attr61 9792 non-null float64 60 Attr62 9792 non-null float64 61 Attr63 9792 non-null float64 62 Attr64 9792 non-null float64 dtypes: float64(63) memory usage: 4.7 MB None
pca = PCA(n_components=2)
df_pca = pca.fit_transform(x)
df_pca = pd.DataFrame(df_pca, columns=['PC1', 'PC2'])
df_pca['class'] = y
pca = PCA(n_components=2)
df_normalized_pca = pca.fit_transform(X_normalized)
df_normalized_pca = pd.DataFrame(data=df_normalized_pca, columns=[f'PC{i}' for i in range(1, 3)])
#add class column
df_normalized_pca['class'] = df['class']
pca = PCA(n_components=2)
df_standardized_pca = pca.fit_transform(X_standardized)
df_standardized_pca = pd.DataFrame(data=df_standardized_pca, columns=[f'PC{i}' for i in range(1, 3)])
#add class column
df_standardized_pca['class'] = df['class']
pca = PCA(n_components=2)
df_norm_inliers_pca = pca.fit_transform(X_norm_inliers)
df_norm_inliers_pca = pd.DataFrame(data=df_norm_inliers_pca, columns=[f'PC{i}' for i in range(1, 3)])
#add class column
df_norm_inliers_pca['class'] = Y_norm_inliers.values
pca = PCA(n_components=2)
df_std_inliers_pca = pca.fit_transform(X_std_inliers)
df_std_inliers_pca = pd.DataFrame(data=df_std_inliers_pca, columns=[f'PC{i}' for i in range(1, 3)])
#add class column
df_std_inliers_pca['class'] = Y_std_inliers.values
print(df_std_inliers_pca['class'].value_counts())
fig, axes = plt.subplots(1, 5, figsize=(16, 5))
plt.figure(figsize=(10, 5))
sns.scatterplot(data=df_pca, x='PC1', y='PC2', hue='class', ax=axes[0])
axes[0].set_title('PCA - Original Data')
sns.scatterplot(data=df_normalized_pca, x='PC1', y='PC2', hue='class', ax=axes[1])
axes[1].set_title('Normalized data')
sns.scatterplot(data=df_standardized_pca, x='PC1', y='PC2', hue='class', ax=axes[2])
axes[2].set_title('Standardized data')
sns.scatterplot(data=df_norm_inliers_pca, x='PC1', y='PC2', hue='class', ax=axes[3])
axes[3].set_title('Normalized without outliers')
sns.scatterplot(data=df_std_inliers_pca, x='PC1', y='PC2', hue='class', ax=axes[4])
axes[4].set_title('Standardized without outliers')
fig.tight_layout()
plt.show()
class b'0' 6631 b'1' 223 Name: count, dtype: int64
<Figure size 1000x500 with 0 Axes>
pca = PCA(n_components=24)
df_normalized_pca = pca.fit_transform(X_normalized)
df_normalized_pca = pd.DataFrame(data=df_normalized_pca, columns=[f'PC{i}' for i in range(1, 25)])
#add class column
df_normalized_pca['class'] = df['class']
pca = PCA(n_components=28)
df_standardized_pca = pca.fit_transform(X_standardized)
df_standardized_pca = pd.DataFrame(data=df_standardized_pca, columns=[f'PC{i}' for i in range(1, 29)])
#add class column
df_standardized_pca['class'] = df['class']
pca_norm_inliers = PCA(n_components=16)
df_norm_inliers_pca = pca_norm_inliers.fit_transform(X_norm_inliers)
df_norm_inliers_pca = pd.DataFrame(data=df_norm_inliers_pca, columns=[f'PC{i}' for i in range(1, 17)])
#add class column
df_norm_inliers_pca['class'] = Y_norm_inliers.values
pca_std_inliers = PCA(n_components=27)
df_std_inliers_pca = pca_std_inliers.fit_transform(X_std_inliers)
df_std_inliers_pca = pd.DataFrame(data=df_std_inliers_pca, columns=[f'PC{i}' for i in range(1, 28)])
#add class column
df_std_inliers_pca['class'] = Y_std_inliers.values
import time
from sklearn.manifold import TSNE
time_start = time.time()
perplexity = 40
n_iter = 500
random_state = 42
fig_height = 10
fig_width = 16
tsne = TSNE(n_components=2, verbose=1, perplexity=perplexity, n_iter=n_iter, random_state=random_state)
tsne_standarized = tsne.fit_transform(X_standardized)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))
tsne_standarized = pd.DataFrame(data=tsne_standarized, columns=['t-SNE1', 't-SNE2'])
# plot
plt.figure(figsize=(fig_width, fig_height))
sns.scatterplot(data=tsne_standarized, x='t-SNE1', y='t-SNE2', hue=df['class'], palette=sns.color_palette("hls", 2), alpha=0.5)
plt.title(f't-SNE standarized data (perplexity={perplexity}, n_iter={n_iter})')
plt.show()
[t-SNE] Computing 121 nearest neighbors... [t-SNE] Indexed 9792 samples in 0.003s...
c:\Users\filip\AppData\Local\Programs\Python\Python312\Lib\site-packages\joblib\externals\loky\backend\context.py:136: UserWarning: Could not find the number of physical cores for the following reason:
found 0 physical cores < 1
Returning the number of logical cores instead. You can silence this warning by setting LOKY_MAX_CPU_COUNT to the number of cores you want to use.
warnings.warn(
File "c:\Users\filip\AppData\Local\Programs\Python\Python312\Lib\site-packages\joblib\externals\loky\backend\context.py", line 282, in _count_physical_cores
raise ValueError(f"found {cpu_count_physical} physical cores < 1")
[t-SNE] Computed neighbors for 9792 samples in 0.714s... [t-SNE] Computed conditional probabilities for sample 1000 / 9792 [t-SNE] Computed conditional probabilities for sample 2000 / 9792 [t-SNE] Computed conditional probabilities for sample 3000 / 9792 [t-SNE] Computed conditional probabilities for sample 4000 / 9792 [t-SNE] Computed conditional probabilities for sample 5000 / 9792 [t-SNE] Computed conditional probabilities for sample 6000 / 9792 [t-SNE] Computed conditional probabilities for sample 7000 / 9792 [t-SNE] Computed conditional probabilities for sample 8000 / 9792 [t-SNE] Computed conditional probabilities for sample 9000 / 9792 [t-SNE] Computed conditional probabilities for sample 9792 / 9792 [t-SNE] Mean sigma: 0.156631 [t-SNE] KL divergence after 250 iterations with early exaggeration: 80.530121 [t-SNE] KL divergence after 500 iterations: 1.733346 t-SNE done! Time elapsed: 37.02339291572571 seconds
[t-SNE] Computing 121 nearest neighbors... [t-SNE] Indexed 9792 samples in 0.005s... [t-SNE] Computed neighbors for 9792 samples in 0.418s... [t-SNE] Computed conditional probabilities for sample 1000 / 9792 [t-SNE] Computed conditional probabilities for sample 2000 / 9792 [t-SNE] Computed conditional probabilities for sample 3000 / 9792 [t-SNE] Computed conditional probabilities for sample 4000 / 9792 [t-SNE] Computed conditional probabilities for sample 5000 / 9792 [t-SNE] Computed conditional probabilities for sample 6000 / 9792 [t-SNE] Computed conditional probabilities for sample 7000 / 9792 [t-SNE] Computed conditional probabilities for sample 8000 / 9792 [t-SNE] Computed conditional probabilities for sample 9000 / 9792 [t-SNE] Computed conditional probabilities for sample 9792 / 9792 [t-SNE] Mean sigma: 0.002874 [t-SNE] KL divergence after 250 iterations with early exaggeration: 71.813110 [t-SNE] KL divergence after 500 iterations: 1.390919 t-SNE done! Time elapsed: 84.31460404396057 seconds
time_start = time.time()
perplexity = 2
n_iter = 500
fig_height = 10
fig_width = 16
tsne = TSNE(n_components=2, verbose=1, perplexity=perplexity, n_iter=n_iter, random_state=random_state)
tsne_normalized = tsne.fit_transform(X_normalized)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))
tsne_normalized = pd.DataFrame(data=tsne_normalized, columns=['t-SNE1', 't-SNE2'])
# plot
plt.figure(figsize=(fig_width, fig_height))
sns.scatterplot(data=tsne_normalized, x='t-SNE1', y='t-SNE2', hue=df['class'], palette=sns.color_palette("hls", 2), alpha=0.5)
plt.title(f't-SNE normalized data (perplexity={perplexity}, n_iter={n_iter})')
plt.show()
[t-SNE] Computing 7 nearest neighbors... [t-SNE] Indexed 9792 samples in 0.003s... [t-SNE] Computed neighbors for 9792 samples in 0.349s... [t-SNE] Computed conditional probabilities for sample 1000 / 9792 [t-SNE] Computed conditional probabilities for sample 2000 / 9792 [t-SNE] Computed conditional probabilities for sample 3000 / 9792 [t-SNE] Computed conditional probabilities for sample 4000 / 9792 [t-SNE] Computed conditional probabilities for sample 5000 / 9792 [t-SNE] Computed conditional probabilities for sample 6000 / 9792 [t-SNE] Computed conditional probabilities for sample 7000 / 9792 [t-SNE] Computed conditional probabilities for sample 8000 / 9792 [t-SNE] Computed conditional probabilities for sample 9000 / 9792 [t-SNE] Computed conditional probabilities for sample 9792 / 9792 [t-SNE] Mean sigma: 0.000791 [t-SNE] KL divergence after 250 iterations with early exaggeration: 95.807755 [t-SNE] KL divergence after 500 iterations: 2.487562 t-SNE done! Time elapsed: 56.609742641448975 seconds
time_start = time.time()
perplexity = 5
n_iter = 500
tsne = TSNE(n_components=2, verbose=1, perplexity=perplexity, n_iter=n_iter)
tsne_normalized = tsne.fit_transform(X_normalized)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))
tsne_normalized = pd.DataFrame(data=tsne_normalized, columns=['t-SNE1', 't-SNE2'])
plt.figure(figsize=(16, 10))
sns.scatterplot(data=tsne_normalized, x='t-SNE1', y='t-SNE2', hue=df['class'], palette=sns.color_palette("hls", 2), alpha=0.5)
plt.title(f't-SNE normalized data (perplexity={perplexity}, n_iter={n_iter})')
plt.show()
[t-SNE] Computing 16 nearest neighbors... [t-SNE] Indexed 9792 samples in 0.004s... [t-SNE] Computed neighbors for 9792 samples in 0.347s... [t-SNE] Computed conditional probabilities for sample 1000 / 9792 [t-SNE] Computed conditional probabilities for sample 2000 / 9792 [t-SNE] Computed conditional probabilities for sample 3000 / 9792 [t-SNE] Computed conditional probabilities for sample 4000 / 9792 [t-SNE] Computed conditional probabilities for sample 5000 / 9792 [t-SNE] Computed conditional probabilities for sample 6000 / 9792 [t-SNE] Computed conditional probabilities for sample 7000 / 9792 [t-SNE] Computed conditional probabilities for sample 8000 / 9792 [t-SNE] Computed conditional probabilities for sample 9000 / 9792 [t-SNE] Computed conditional probabilities for sample 9792 / 9792 [t-SNE] Mean sigma: 0.001513 [t-SNE] KL divergence after 250 iterations with early exaggeration: 89.871399 [t-SNE] KL divergence after 300 iterations: 3.805093 t-SNE done! Time elapsed: 27.69949507713318 seconds
time_start = time.time()
perplexity = 10
n_iter = 500
tsne = TSNE(n_components=2, verbose=1, perplexity=perplexity, n_iter=n_iter, random_state=random_state)
tsne_normalized = tsne.fit_transform(X_normalized)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))
tsne_normalized = pd.DataFrame(data=tsne_normalized, columns=['t-SNE1', 't-SNE2'])
plt.figure(figsize=(16, 10))
sns.scatterplot(data=tsne_normalized, x='t-SNE1', y='t-SNE2', hue=df['class'], palette=sns.color_palette("hls", 2), alpha=0.5)
plt.title(f't-SNE normalized data (perplexity={perplexity}, n_iter={n_iter})')
plt.show()
[t-SNE] Computing 31 nearest neighbors... [t-SNE] Indexed 9792 samples in 0.005s... [t-SNE] Computed neighbors for 9792 samples in 0.747s... [t-SNE] Computed conditional probabilities for sample 1000 / 9792 [t-SNE] Computed conditional probabilities for sample 2000 / 9792 [t-SNE] Computed conditional probabilities for sample 3000 / 9792 [t-SNE] Computed conditional probabilities for sample 4000 / 9792 [t-SNE] Computed conditional probabilities for sample 5000 / 9792 [t-SNE] Computed conditional probabilities for sample 6000 / 9792 [t-SNE] Computed conditional probabilities for sample 7000 / 9792 [t-SNE] Computed conditional probabilities for sample 8000 / 9792 [t-SNE] Computed conditional probabilities for sample 9000 / 9792 [t-SNE] Computed conditional probabilities for sample 9792 / 9792 [t-SNE] Mean sigma: 0.001920 [t-SNE] KL divergence after 250 iterations with early exaggeration: 83.811539 [t-SNE] KL divergence after 500 iterations: 1.912423 t-SNE done! Time elapsed: 61.91113305091858 seconds
time_start = time.time()
perplexity = 20
n_iter = 500
tsne = TSNE(n_components=2, verbose=1, perplexity=perplexity, n_iter=n_iter, random_state=random_state)
tsne_normalized = tsne.fit_transform(X_normalized)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))
tsne_normalized = pd.DataFrame(data=tsne_normalized, columns=['t-SNE1', 't-SNE2'])
plt.figure(figsize=(16, 10))
sns.scatterplot(data=tsne_normalized, x='t-SNE1', y='t-SNE2', hue=df['class'], palette=sns.color_palette("hls", 2), alpha=0.5)
plt.title(f't-SNE normalized data (perplexity={perplexity}, n_iter={n_iter})')
plt.show()
[t-SNE] Computing 61 nearest neighbors... [t-SNE] Indexed 9792 samples in 0.007s... [t-SNE] Computed neighbors for 9792 samples in 0.597s... [t-SNE] Computed conditional probabilities for sample 1000 / 9792 [t-SNE] Computed conditional probabilities for sample 2000 / 9792 [t-SNE] Computed conditional probabilities for sample 3000 / 9792 [t-SNE] Computed conditional probabilities for sample 4000 / 9792 [t-SNE] Computed conditional probabilities for sample 5000 / 9792 [t-SNE] Computed conditional probabilities for sample 6000 / 9792 [t-SNE] Computed conditional probabilities for sample 7000 / 9792 [t-SNE] Computed conditional probabilities for sample 8000 / 9792 [t-SNE] Computed conditional probabilities for sample 9000 / 9792 [t-SNE] Computed conditional probabilities for sample 9792 / 9792 [t-SNE] Mean sigma: 0.002349 [t-SNE] KL divergence after 250 iterations with early exaggeration: 77.870216 [t-SNE] KL divergence after 500 iterations: 1.658014 t-SNE done! Time elapsed: 85.2188286781311 seconds
time_start = time.time()
perplexity = 30
n_iter = 500
tsne = TSNE(n_components=2, verbose=1, perplexity=perplexity, n_iter=n_iter, random_state=random_state)
tsne_normalized = tsne.fit_transform(X_normalized)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))
tsne_normalized = pd.DataFrame(data=tsne_normalized, columns=['t-SNE1', 't-SNE2'])
plt.figure(figsize=(16, 10))
sns.scatterplot(data=tsne_normalized, x='t-SNE1', y='t-SNE2', hue=df['class'], palette=sns.color_palette("hls", 2), alpha=0.5)
plt.title(f't-SNE normalized data (perplexity={perplexity}, n_iter={n_iter})')
plt.show()
[t-SNE] Computing 91 nearest neighbors... [t-SNE] Indexed 9792 samples in 0.006s... [t-SNE] Computed neighbors for 9792 samples in 0.561s... [t-SNE] Computed conditional probabilities for sample 1000 / 9792 [t-SNE] Computed conditional probabilities for sample 2000 / 9792 [t-SNE] Computed conditional probabilities for sample 3000 / 9792 [t-SNE] Computed conditional probabilities for sample 4000 / 9792 [t-SNE] Computed conditional probabilities for sample 5000 / 9792 [t-SNE] Computed conditional probabilities for sample 6000 / 9792 [t-SNE] Computed conditional probabilities for sample 7000 / 9792 [t-SNE] Computed conditional probabilities for sample 8000 / 9792 [t-SNE] Computed conditional probabilities for sample 9000 / 9792 [t-SNE] Computed conditional probabilities for sample 9792 / 9792 [t-SNE] Mean sigma: 0.002639 [t-SNE] KL divergence after 250 iterations with early exaggeration: 74.390587 [t-SNE] KL divergence after 500 iterations: 1.499874 t-SNE done! Time elapsed: 61.678648710250854 seconds
time_start = time.time()
perplexity = 40
n_iter = 500
tsne = TSNE(n_components=2, verbose=1, perplexity=perplexity, n_iter=n_iter, random_state=random_state)
tsne_normalized = tsne.fit_transform(X_normalized)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))
tsne_normalized = pd.DataFrame(data=tsne_normalized, columns=['t-SNE1', 't-SNE2'])
plt.figure(figsize=(16, 10))
sns.scatterplot(data=tsne_normalized, x='t-SNE1', y='t-SNE2', hue=df['class'], palette=sns.color_palette("hls", 2), alpha=0.5)
plt.title(f't-SNE normalized data (perplexity={perplexity}, n_iter={n_iter})')
plt.show()
[t-SNE] Computing 121 nearest neighbors... [t-SNE] Indexed 9792 samples in 0.004s... [t-SNE] Computed neighbors for 9792 samples in 0.668s... [t-SNE] Computed conditional probabilities for sample 1000 / 9792 [t-SNE] Computed conditional probabilities for sample 2000 / 9792 [t-SNE] Computed conditional probabilities for sample 3000 / 9792 [t-SNE] Computed conditional probabilities for sample 4000 / 9792 [t-SNE] Computed conditional probabilities for sample 5000 / 9792 [t-SNE] Computed conditional probabilities for sample 6000 / 9792 [t-SNE] Computed conditional probabilities for sample 7000 / 9792 [t-SNE] Computed conditional probabilities for sample 8000 / 9792 [t-SNE] Computed conditional probabilities for sample 9000 / 9792 [t-SNE] Computed conditional probabilities for sample 9792 / 9792 [t-SNE] Mean sigma: 0.002874 [t-SNE] KL divergence after 250 iterations with early exaggeration: 71.813011 [t-SNE] KL divergence after 500 iterations: 1.390330 t-SNE done! Time elapsed: 66.27242016792297 seconds
time_start = time.time()
perplexity = 50
n_iter = 500
tsne = TSNE(n_components=2, verbose=1, perplexity=perplexity, n_iter=n_iter, random_state=random_state)
tsne_normalized = tsne.fit_transform(X_normalized)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))
tsne_normalized = pd.DataFrame(data=tsne_normalized, columns=['t-SNE1', 't-SNE2'])
plt.figure(figsize=(16, 10))
sns.scatterplot(data=tsne_normalized, x='t-SNE1', y='t-SNE2', hue=df['class'], palette=sns.color_palette("hls", 2), alpha=0.5)
plt.title(f't-SNE normalized data (perplexity={perplexity}, n_iter={n_iter})')
plt.show()
[t-SNE] Computing 151 nearest neighbors... [t-SNE] Indexed 9792 samples in 0.004s... [t-SNE] Computed neighbors for 9792 samples in 0.797s... [t-SNE] Computed conditional probabilities for sample 1000 / 9792 [t-SNE] Computed conditional probabilities for sample 2000 / 9792 [t-SNE] Computed conditional probabilities for sample 3000 / 9792 [t-SNE] Computed conditional probabilities for sample 4000 / 9792 [t-SNE] Computed conditional probabilities for sample 5000 / 9792 [t-SNE] Computed conditional probabilities for sample 6000 / 9792 [t-SNE] Computed conditional probabilities for sample 7000 / 9792 [t-SNE] Computed conditional probabilities for sample 8000 / 9792 [t-SNE] Computed conditional probabilities for sample 9000 / 9792 [t-SNE] Computed conditional probabilities for sample 9792 / 9792 [t-SNE] Mean sigma: 0.003077 [t-SNE] KL divergence after 250 iterations with early exaggeration: 69.729156 [t-SNE] KL divergence after 500 iterations: 1.298629 t-SNE done! Time elapsed: 72.84613847732544 seconds
time_start = time.time()
perplexity = 70
n_iter = 500
tsne = TSNE(n_components=2, verbose=1, perplexity=perplexity, n_iter=n_iter, random_state=random_state)
tsne_normalized = tsne.fit_transform(X_normalized)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))
tsne_normalized = pd.DataFrame(data=tsne_normalized, columns=['t-SNE1', 't-SNE2'])
plt.figure(figsize=(16, 10))
sns.scatterplot(data=tsne_normalized, x='t-SNE1', y='t-SNE2', hue=df['class'], palette=sns.color_palette("hls", 2), alpha=0.5)
plt.title(f't-SNE normalized data (perplexity={perplexity}, n_iter={n_iter})')
plt.show()
[t-SNE] Computing 211 nearest neighbors... [t-SNE] Indexed 9792 samples in 0.006s... [t-SNE] Computed neighbors for 9792 samples in 0.696s... [t-SNE] Computed conditional probabilities for sample 1000 / 9792 [t-SNE] Computed conditional probabilities for sample 2000 / 9792 [t-SNE] Computed conditional probabilities for sample 3000 / 9792 [t-SNE] Computed conditional probabilities for sample 4000 / 9792 [t-SNE] Computed conditional probabilities for sample 5000 / 9792 [t-SNE] Computed conditional probabilities for sample 6000 / 9792 [t-SNE] Computed conditional probabilities for sample 7000 / 9792 [t-SNE] Computed conditional probabilities for sample 8000 / 9792 [t-SNE] Computed conditional probabilities for sample 9000 / 9792 [t-SNE] Computed conditional probabilities for sample 9792 / 9792 [t-SNE] Mean sigma: 0.003429 [t-SNE] KL divergence after 250 iterations with early exaggeration: 66.480576 [t-SNE] KL divergence after 500 iterations: 1.156768 t-SNE done! Time elapsed: 77.62437748908997 seconds
time_start = time.time()
perplexity = 40
n_iter = 2000
tsne = TSNE(n_components=2, verbose=1, perplexity=perplexity, n_iter=n_iter, random_state=random_state)
tsne_normalized = tsne.fit_transform(X_normalized)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))
tsne_normalized = pd.DataFrame(data=tsne_normalized, columns=['t-SNE1', 't-SNE2'])
plt.figure(figsize=(16, 10))
sns.scatterplot(data=tsne_normalized, x='t-SNE1', y='t-SNE2', hue=df['class'], palette=sns.color_palette("hls", 2), alpha=0.5)
plt.title(f't-SNE normalized data (perplexity={perplexity}, n_iter={n_iter})')
plt.show()
[t-SNE] Computing 121 nearest neighbors... [t-SNE] Indexed 9792 samples in 0.004s... [t-SNE] Computed neighbors for 9792 samples in 0.435s... [t-SNE] Computed conditional probabilities for sample 1000 / 9792 [t-SNE] Computed conditional probabilities for sample 2000 / 9792 [t-SNE] Computed conditional probabilities for sample 3000 / 9792 [t-SNE] Computed conditional probabilities for sample 4000 / 9792 [t-SNE] Computed conditional probabilities for sample 5000 / 9792 [t-SNE] Computed conditional probabilities for sample 6000 / 9792 [t-SNE] Computed conditional probabilities for sample 7000 / 9792 [t-SNE] Computed conditional probabilities for sample 8000 / 9792 [t-SNE] Computed conditional probabilities for sample 9000 / 9792 [t-SNE] Computed conditional probabilities for sample 9792 / 9792 [t-SNE] Mean sigma: 0.002874 [t-SNE] KL divergence after 250 iterations with early exaggeration: 71.813110 [t-SNE] KL divergence after 2000 iterations: 1.085735 t-SNE done! Time elapsed: 223.09739184379578 seconds
time_start = time.time()
perplexity = 40
n_iter = 500
random_state = 42
fig_height = 10
fig_width = 16
tsne = TSNE(n_components=2, verbose=1, perplexity=perplexity, n_iter=n_iter, random_state=random_state)
tsne_standarized = tsne.fit_transform(X_standardized)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))
tsne_standarized = pd.DataFrame(data=tsne_standarized, columns=['t-SNE1', 't-SNE2'])
# plot
plt.figure(figsize=(fig_width, fig_height))
sns.scatterplot(data=tsne_standarized, x='t-SNE1', y='t-SNE2', hue=df['class'], palette=sns.color_palette("hls", 2), alpha=0.5)
plt.title(f't-SNE standarized data (perplexity={perplexity}, n_iter={n_iter})')
plt.show()
[t-SNE] Computing 121 nearest neighbors... [t-SNE] Indexed 9792 samples in 0.007s...
[t-SNE] Computed neighbors for 9792 samples in 0.971s... [t-SNE] Computed conditional probabilities for sample 1000 / 9792 [t-SNE] Computed conditional probabilities for sample 2000 / 9792 [t-SNE] Computed conditional probabilities for sample 3000 / 9792 [t-SNE] Computed conditional probabilities for sample 4000 / 9792 [t-SNE] Computed conditional probabilities for sample 5000 / 9792 [t-SNE] Computed conditional probabilities for sample 6000 / 9792 [t-SNE] Computed conditional probabilities for sample 7000 / 9792 [t-SNE] Computed conditional probabilities for sample 8000 / 9792 [t-SNE] Computed conditional probabilities for sample 9000 / 9792 [t-SNE] Computed conditional probabilities for sample 9792 / 9792 [t-SNE] Mean sigma: 0.156631 [t-SNE] KL divergence after 250 iterations with early exaggeration: 80.530121 [t-SNE] KL divergence after 500 iterations: 1.733346 t-SNE done! Time elapsed: 135.84579229354858 seconds
time_start = time.time()
perplexity = 40
n_iter = 2000
random_state = 42
fig_height = 10
fig_width = 16
tsne = TSNE(n_components=2, verbose=1, perplexity=perplexity, n_iter=n_iter, random_state=random_state)
tsne_standarized = tsne.fit_transform(X_standardized)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))
tsne_standarized = pd.DataFrame(data=tsne_standarized, columns=['t-SNE1', 't-SNE2'])
# plot
plt.figure(figsize=(fig_width, fig_height))
sns.scatterplot(data=tsne_standarized, x='t-SNE1', y='t-SNE2', hue=df['class'], palette=sns.color_palette("hls", 2), alpha=0.5)
plt.title(f't-SNE standarized data (perplexity={perplexity}, n_iter={n_iter})')
plt.show()
[t-SNE] Computing 121 nearest neighbors... [t-SNE] Indexed 9792 samples in 0.013s... [t-SNE] Computed neighbors for 9792 samples in 2.093s... [t-SNE] Computed conditional probabilities for sample 1000 / 9792 [t-SNE] Computed conditional probabilities for sample 2000 / 9792 [t-SNE] Computed conditional probabilities for sample 3000 / 9792 [t-SNE] Computed conditional probabilities for sample 4000 / 9792 [t-SNE] Computed conditional probabilities for sample 5000 / 9792 [t-SNE] Computed conditional probabilities for sample 6000 / 9792 [t-SNE] Computed conditional probabilities for sample 7000 / 9792 [t-SNE] Computed conditional probabilities for sample 8000 / 9792 [t-SNE] Computed conditional probabilities for sample 9000 / 9792 [t-SNE] Computed conditional probabilities for sample 9792 / 9792 [t-SNE] Mean sigma: 0.156631 [t-SNE] KL divergence after 250 iterations with early exaggeration: 80.530121 [t-SNE] KL divergence after 2000 iterations: 1.509810 t-SNE done! Time elapsed: 461.435617685318 seconds
T-SNE jest algorytmem redukcji wymiarów, który jest używany do wizualizacji danych w dwóch lub trzech wymiarach. Używany do wizualizacji danych w celu zrozumienia struktury danych. T-SNE w porównaniu do PCA próbuje utrzymać lokalne strutkury, klastry, a nie globalne jak PCA. Utrzymuje sąsiedztwo punktów w przestrzeni. Zwraca uawgę na odległości między punktami gwaratnuje bliskie położenie punktów należących do klastrów, aczkolwiek nie gwarantuje zachowania odległości między klastrami. Nie utrzymuje odległości między odległymi punktami. Należy TSNE wykonać wiele razy dla różnych wartości parametrów perplexity oraz liczbie iteracji. Rozkład T-Studenta pomaga w rozwiązaniu problemu "crowding problem". T-SNE rozpycha gęste klastry i ściąga rzadkie klastry, dlatego interpretacja "wielkosći" między klastrami może nie mieć sensu. Tutaj zastosowano próbę ustabilizowania ukłądu przy narastającej wartości perplexity, a potem wybrano odpowiednią liczbę iteracji. Niestety nie udało się uzyskać zadowalającego wyniku.
5. Uruchomienie wybranego modelu klasyfikacji lub grupowania. Analiza porównawcza wyników oraz decyzji podjętych w trakcie przygotowania danych do modelowania¶
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, ConfusionMatrixDisplay
random_state = 42
def evaluate_model(data):
model = DecisionTreeClassifier(random_state=random_state)
if data['class'].dtype != 'int':
data['class'] = data['class'].astype('int')
y = data['class']
data = data.drop('class', axis=1)
X_train, X_test, y_train, y_test = train_test_split(data, y, test_size=0.3, random_state=random_state)
print(f"Model: {model.__class__.__name__}")
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
cr = classification_report(y_test, y_pred, output_dict=True)
return cm, cr
tsne_normalized['class'] = df['class']
df_orginal_cm, df_orginal_cr = evaluate_model(df)
df_normalized_cm, df_normalized_cr = evaluate_model(df_normalized)
df_standardized_cm, df_standardized_cr = evaluate_model(df_standardized)
df_norm_inliers_cm, df_norm_inliers_cr = evaluate_model(df_norm_inliers)
df_std_inliers_cm, df_std_inliers_cr = evaluate_model(df_std_inliers)
df_norm_pca_cm, df_norm_pca_cr = evaluate_model(df_normalized_pca)
df_std_pca_cm, df_std_pca_cr = evaluate_model(df_standardized_pca)
df_norm_inliers_pca_cm, df_norm_inliers_pca_cr = evaluate_model(df_norm_inliers_pca)
df_std_inliers_pca_cm, df_std_inliers_pca_cr = evaluate_model(df_std_inliers_pca)
df_tsne_normalized_cm, df_tsne_normalized_cr = evaluate_model(tsne_normalized)
Model: DecisionTreeClassifier Model: DecisionTreeClassifier Model: DecisionTreeClassifier Model: DecisionTreeClassifier Model: DecisionTreeClassifier Model: DecisionTreeClassifier Model: DecisionTreeClassifier Model: DecisionTreeClassifier Model: DecisionTreeClassifier Model: DecisionTreeClassifier
#make a subplot of all confusion matrix
fig, axes = plt.subplots(2, 5, figsize=(20, 10))
ConfusionMatrixDisplay(df_orginal_cm).plot( values_format='d', ax=axes[0][0])
axes[0][0].set_title('Original Data')
ConfusionMatrixDisplay(df_normalized_cm).plot( values_format='d', ax=axes[0][1])
axes[0][1].set_title('Normalized Data')
ConfusionMatrixDisplay(df_standardized_cm).plot( values_format='d', ax=axes[0][2])
axes[0][2].set_title('Standardized Data')
ConfusionMatrixDisplay(df_norm_inliers_cm).plot( values_format='d', ax=axes[0][3])
axes[0][3].set_title('Normalized without outliers')
ConfusionMatrixDisplay(df_std_inliers_cm).plot( values_format='d', ax=axes[0][4])
axes[0][4].set_title('Standardized without outliers')
ConfusionMatrixDisplay(df_norm_pca_cm).plot( values_format='d', ax=axes[1][0])
axes[1][0].set_title('Normalized PCA')
ConfusionMatrixDisplay(df_std_pca_cm).plot( values_format='d', ax=axes[1][1])
axes[1][1].set_title('Standardized PCA')
ConfusionMatrixDisplay(df_tsne_normalized_cm).plot( values_format='d', ax=axes[1][2])
axes[1][2].set_title('t-SNE normalized data')
ConfusionMatrixDisplay(df_norm_inliers_pca_cm).plot( values_format='d', ax=axes[1][3])
axes[1][3].set_title('Normalized without outliers PCA')
ConfusionMatrixDisplay(df_std_inliers_pca_cm).plot( values_format='d', ax=axes[1][4])
axes[1][4].set_title('Standardized without outliers PCA')
fig.tight_layout()
plt.show()
#Wniosk
fig, axes = plt.subplots(2, 5, figsize=(20, 10))
sns.heatmap(pd.DataFrame(df_orginal_cr).iloc[:-1, :].T, annot=True, ax=axes[0][0], cmap='viridis')
axes[0][0].set_title('Original Data')
sns.heatmap(pd.DataFrame(df_normalized_cr).iloc[:-1, :].T, annot=True, ax=axes[0][1], cmap='viridis')
axes[0][1].set_title('Normalized Data')
sns.heatmap(pd.DataFrame(df_standardized_cr).iloc[:-1, :].T, annot=True, ax=axes[0][2], cmap='viridis')
axes[0][2].set_title('Standardized Data')
sns.heatmap(pd.DataFrame(df_norm_inliers_cr).iloc[:-1, :].T, annot=True, ax=axes[0][3], cmap='viridis')
axes[0][3].set_title('Normalized without outliers')
sns.heatmap(pd.DataFrame(df_std_inliers_cr).iloc[:-1, :].T, annot=True, ax=axes[0][4], cmap='viridis')
axes[0][4].set_title('Standardized without outliers')
sns.heatmap(pd.DataFrame(df_norm_pca_cr).iloc[:-1, :].T, annot=True, ax=axes[1][0], cmap='viridis')
axes[1][0].set_title('Normalized PCA')
sns.heatmap(pd.DataFrame(df_std_pca_cr).iloc[:-1, :].T, annot=True, ax=axes[1][1], cmap='viridis')
axes[1][1].set_title('Standardized PCA')
sns.heatmap(pd.DataFrame(df_tsne_normalized_cr).iloc[:-1, :].T, annot=True, ax=axes[1][2], cmap='viridis')
axes[1][2].set_title('t-SNE normalized data')
sns.heatmap(pd.DataFrame(df_norm_inliers_pca_cr).iloc[:-1, :].T, annot=True, ax=axes[1][3], cmap='viridis')
axes[1][3].set_title('Normalized without outliers PCA')
sns.heatmap(pd.DataFrame(df_std_inliers_pca_cr).iloc[:-1, :].T, annot=True, ax=axes[1][4], cmap='viridis')
axes[1][4].set_title('Standardized without outliers PCA')
fig.tight_layout()
plt.show()